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Abstract 

We study the problem of estimating the time delay between two signals represent- 
ing delayed, irregularly sampled and noisy versions of the same underlying pattern. 
We propose and demonstrate an evolutionary algorithm for the (hyper)parameter 
estimation of a kernel-based technique in the context of an astronomical problem, 
namely estimating the time delay between two gravitationally lensed signals from a 
distant quasar. Mixed types (integer and real) are used to represent variables within 
the evolutionary algorithm. We test the algorithm on several artificial data sets, and 
also on real astronomical observations of quasar Q0957-I-561. By carrying out a sta- 
tistical analysis of the results we present a detailed comparison of our method with 
the most popular methods for time delay estimation in astrophysics. Our method 
yields more accurate and more stable time delay estimates: for Q0957+561, we ob- 
tain 419.6 days between images A and B. Our methodology can be readily applied 
to current state-of-the-art optical monitoring data in astronomy, but can also be 
applied in other disciplines involving similar time series data. 
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1 Introduction 



The estimation of time delay, the delay between arrival times of two signals 
that originate from the same source but travel along different paths to the 
observer, is a real-world problem in Astronomy. A time series to be analysed 
could, for instance, represent the repeated measurement, over many months 
or years, of the flux of radiation (optical light or radio waves) from a very 
distant quasar, a very bright source of light usually a few billion light-years 
away. Some of these quasars appear as a set of multiple nearby images on the 
sky, due to the fact that the trajectory of light coming from the source gets 
bent as it passes a massive galaxy on the way (the "lens"), and, as a result, the 
observer receives the light from various directions, resulting in the detection 
of several images [29.46J. This phenomenon is called gravitational lensing, and 
is a natural consequence of a prediction of the General theory of Relativity, 
which postulates that massive objects distort space-time and thus cause the 
bending of trajectories of light rays passing near them. Quasars are variable 
sources, and the same sequence of variations is detected at different times in 
the different images, according to the travel time along the various paths. The 
time delay between the signals depends on the mass of the lens, and thus it is 
the most direct method to measure the distribution of matter in the Universe, 
which is often dark P3|l29] . 

In this scenario, the underlying pattern in time of emitted flux intensities from 
a quasar gets delayed and corrupted by all kinds of noise processes. For ex- 
ample, astronomical time series are not only corrupted by observational noise, 
but they are also typically irregularly sampled with possibly large observa- 
tional gaps (missing data) [33|40|32|[27] . This is due to practical limitations of 
observation such as equipment availability, weather conditions, the brightness 
of the moon, among many other factors [T7] . Over a hundred systems of lensed 
quasars are currently knowiH], and about 10 of these have been monitored for 
long periods, and in some of these cases, the measurement of a time delay has 
been claimed. Here we focus on Q0957-I-561, the first multiply-imaged quasar 
to be discovered [51]. This source, which has a pair of images (here referred to 
as A and B), has been monitored for over twenty years, and despite numerous 
claims, a universally agreed value for the time delay in this system has not 
emerged [5U|IT^ . 

In an earlier paper, we presented an analysis of repeated radio observations, 
along with simulated data generated according to the properties of these ob- 
servations [H], to show that a kernel-based approach can improve upon the 
currently popular methods of estimating time delays from real astronomi- 



^ A growing list of multiply-imaged gravitationally lensed quasars can be found at 
ht tp : / / cf a- www, harvard . edu/cast les . 
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cal data. The more common form of observations, however, employs optical 
telescopes for monitoring known multiply-imaged sources, and these obser- 
vations have inherent problems that require the modification of our previous 
approach. Here we present a largely modified approach that outperforms on 
optical datasets our previous appraoch, as well as alternative approaches in 
use in astrophysics. 

Here we introduce a novel evolutionary algorithm (EA) to estimate the pa- 
rameters of a model-based method for time delay estimation. The EA uses, as 
a fitness function, the mean squared error (MSEf;^) given by cross-validation 
on observed data, and also performs a novel regularisation procedure based 
on singular value decomposition (SVD). Our population is also represented by 
mixed types, integers and reals. 

The contribution of this paper is in several directions: i) an evolutionary al- 
gorithm has been introduced to form a novel hybridisation with our kernel 
method, ii) a principled automatic method has been proposed to estimate the 
time delay, kernel width, and SVD regularisation parameters, iii) the applica- 
tion of EA driven by a model based formulation to a real-world problem, and 
iv) we carefully study statistical significance of the results on different data. 

Our EA is an evolutionary optimisation technique in presence of uncertainties 
[28] and missing data with mixed representation - through two linked popu- 
lations, each devoted to one particular data type. The parameters to optimise 
come from a kernel machine. We do parameter optimisation and model selec- 
tion at the same time. This approach can be applied to other problems, not 
only time series from gravitational lensing. For instance, the missing data prob- 
lems cover those cases where instrumental equipment fails, observations are 
incorrectly recorded, sociological factors are involved, etc. Therefore, the data 
are unevenly sampled, which restricts the use of Fourier analysis |42j(§13.8). 
Problems with noisy and missing data occur in almost all sciences, where the 
data availabihty is influenced by what is easy or feasible to collect (e.g., see 
[T21I6] ). 

We compare the performance of our EA in several ways: 

(1) The performance of our method is assessed against that of two of the 
most popular methods in the astrophysical literature [50|18ll33|ll|22j . i.e., 
(a) the Dispersion spectra method [36|37|35pl] and (b) a scheme based 
on the structure function of the intrinsic variability of the source, here 
referred to as the PRH method [¥T] . 

(2) Because the true time delay of observed fluxes from quasars is not known, 
we assess the performance of algorithms in a controlled series of experi- 
ments, where artificially generated data with known delays are used. We 
employ three kinds of artificial data sets: large scale data [2], PRH data 
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[iHTl] and Wiener data (as outlined in [23ll2i] l. 
(3) To justify our EA, an analogous non-evolutionary model-based approach 
(K-V) is also employed in this paper. 

Our statistical analysis shows that the results from our EA are more accurate 
and significant than state-of-the-art methods. We use our EA as well as a 
(l-l-l)-ES algorithm |35] on actual astronomical observations, where the twin 
images were observed over several years with optical telescopes [30j. 




The remainder of this paper is organised as follows: the data under analysis is 
described in §2J The kernel approach is outlined in §31 and the EA is presented 
in §H The results and our conclusions are in §51 and §3 respectively. Finally, 
our future work is presented in ^ 



2 Data 

2. 1 Optical Data 

In this paper, we use optical observation^ of the two images of the quasar 
Q0957-I-561, from a monitoring program at the Apache Point Observatory, 
New Mexico, USA [30j. This data set has 97 observations, where, in each 
observation, fluxes are measured of all the multiple images of the source, in 
the (7-band (a standard yellow-green filter), from December 1994 to July 1996. 

The observed time series (here called light curves) are given in Table [1], where 
the Time column, representing the time of observation (note that it is irreg- 
ularly sampled), is given in Julian days (JD, defined as the number of days 
since Noon GMT on January 1, 4713 BC). The fluxes observed from images 
A and B are given in the astronomical unit of magnitude (mag m), defined 
as m = —2.5 log^g /> "where / is the flux measured when observed through a 
green filtei0 (^f-band). In Fig. [T], the time series are shown. The measurement 
errors, which are standard deviations (std) of the flux measurement, are given 
in the Tabled] as Error A and Error B; these are the error bars. The source was 
monitored nightly, but many observations were missed due to cloudy weather 
and telescope scheduling. The big gap in Fig. [1] is an intentional gap in the 
nightly monitoring, since a delay of about 400 days, the pattern, was known 
'a priori' - monitoring programs on this quasar started in 1979. Therefore, the 
peak in the light curve of image A, between 700 and 800 days, corresponds to 
the peak in that of image B between 1,100 and 1,200 days. 

^ Astronomers observe quasars at other wavelengths as well, e.g., with radio tele- 
scopes [22j . 

^ This data set is available online [30] . 
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Table 1 

Optical data: Q0957+561 observed in the (7-band, from [30] 
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Optical data: 0957+561 
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Fig. 1. Observations of the brightness of the doubly-imaged quasar Q0957+561, in 
the 5f-band, as a function of time (Top: Image A; Bottom: Image B, see Tabled]). 
The time is measured in days (Julian days-2,449,000 days). 



2.2 Artificial Data 



Since the definite time delay for the Q0957+561 is unknown, one cannot test 
the accuracy of methods through real data. Therefore, many attempts have 
been made to generate synthetic data in order to test the performance of 
methods (e.g. [lT1l39]|7IIT7ll23f 24j ) . Below we describe three kinds of artificial 
data sets that we have used, representing the major classes of data sets used 
by others: large scale data, PRH data and Wiener data. 



- ^ -A 
■ B 
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Table 2 

Simulated Large Scale Data sets 
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Total = 7,701 data sets per underlying function. 
5 underlying functions yield 38,505 data sets. 

2.2.1 Large Scale Data 

In this data set (DS-5), the true time delay is 5 days \X5\, and the true offset in 
brightness between image A and image B is M = 0.1 mag. The intention here 
is to simulate optical observations as in Ovaldsen et al. [53]. We employ only 
the first five underlying function^. These data sets are irregularly sampled 
with three levels of noise and gaps of different size as shown in Table [2l for more 
details see p^|IT5] . We use 50 realisations per level of noise only. Consequently, 
this yields 38,505 data sets (see Table [2]), with 50 samples each, of which two 
are shown in Fig. [21 



2.2.2 PRH Data 

These data sets are generated by Gaussian processes, following [H], with a 
fixed covariance matrix given by a structure function according to Pindor [40j. 
The variance representing the measurement errors is 1 x 10^''. They are highly 
sampled with periodic gaps [17], simulating a monitoring campaign of eight 
months; yielding 61 samples per time series. There are seven true delays and 
100 realisations for each value of true delay [T^. Two plots are shown in Fig. [31 



2.2.3 Wiener Data 

These data sets, generated by a Bayesian model simulate three levels of 
noise with 225 data sets per level of noise, where each level of noise represents 
the variance: 0.1^, 0.2^ and 0.4^. The data are irregularly sampled and the 
true time delay in all cases is 35 days. Some examples are shown in Fig. [H 
Each time series has 100 samples. 



Plots are available at http://www.cs.bham.ac.uk/~jcc/artificial-optical/ 
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DS-5-1-G-0-N-0.dat with error bars of 0.106% 
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(b) 

Fig. 2. The simulated Large Scale Data sets, as outlined in TableO (a) The first un- 
derlying function (DS-5-1) without noise and no gaps. Error bars represent 0.106% of 
mag.(b) This data set corresponds to the same underlying function (DS-5-1), with- 
out noise, and the gap size is five (first realisation). Error bars represent 0.466% of 
flux. 



3 Kernel Approach 



In previous papers, [THIS], we have introduced a kernel-based approach, and 
we would refer the reader to these papers for further detail, since not all 
derivations will be repeated here at the same level of detail. The aim of this 
section is to come up with the parameters to evolve in ^ 

We model a pair of time series, obtained by monitoring the brightness of 
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Artificial Data; PRH data true delay=34 
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Artificial Data; PRH data true delay=66 
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(b) 

Fig. 3. Examples of the simulated PRH Data. The error bars represent a variance 
of 1 X 10"''. (a) This is a realisation for a true delay of 34 days. Image A has been 
shifted upwards by 0.08 for visualisation, (b) In this realisation, the true delay is 
66 days. Image A has been shifted upwards by 0.1 for visualisation. 

images A and B (see as 



XA{ti) = hA{ti) +eA{ti) 
XB{ti) = hB{U)eM + eB{ti), 



where = {x, — } denotes either multiplication or subtraction, so M is either 
a ratio (used in radio observations, where brightness is quoted in fliix units) or 
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(b) 

Fig. 4. Simulated Wiener Data, (a) The first realisation of data sets with noise of 
variance 0.1^. Image A has been shifted upwards by a factor 1.5 for visualisation, 
(b) The first realisation of data sets with noise of variance 0.4?. Image A has 
been shifted upwards by 2.9 for visualisation. In each case, error bars represent the 
standard deviation. 

an offset between the two images (as in optical observations, where brightness 
in represented in logarithmic units). We use the latter option here. Values of 
the independent variable ti,i = 1,2, ...,n represent discrete observation times. 
The observation errors £^(^1) and SB{ti) are modelled as zero- mean Normal 
distributions 



N{0,aA{ti)) and N{0,aB{U)), 



(2) 
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respectively, where and (TBiU) are standard deviations. Now, 

N 

hA{U) = ^j^i'^j^ ^i) (3) 
i=i 

is the "underlying" light curve that underpins image A, whereas 

N 

hBiti) = J2a,K{c,+A,ti) (4) 

is a time-delayed (by A) version of hA{ti) underpinning image B. 

The functions hA_ and Hb are formulated within the generalised linear re- 
gression framework [251H7] . Each function is a linear superposition of N ker- 
nels K{-,-) centred at either cj, j = 1,2,...,N (function f^), or cj + A, 
j = 1,2, ...,N (function /b). We use Gaussian kernels of width Uc- for c, t G 9ft, 



The kernel width > determines the 'degree of smoothness' of the models 
Ha and Hb ■ We position kernels at the position of each observation, implying 
N = n. The width utj = ujc is determined through the k nearest neighbours of 



K{c, t) = exp 




(5) 



Cj (equal to tj) as 




(6) 



The weights a (l3])-(j4]) are obtained as follows [14j: 
Kd = X, 



(7) 



where a = (ai, 02, aN)"^, 



Ka{;-) 



XA{-)/(rA{-) 



K 



X = 



8 



XB{-)/crB{-) 



and the kernels Ka{-, ■), Kb{-, ■) have the form: 



KA{c,t) = ^^^, KB{c,t) = 



MeK{c + A,t) 

(TBit) 



(9) 
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Hence, 



(10) 



Our aim is to estimate the time delay A between the temporal light curves 
corresponding to images A and B. Typically, A is estimated by a set of trial 
time delays in the range [A^m, ^max] with a specific measurement of good- 
ness of fit [H]. In Eq. [TOl the superscript "+" represents a pseudoinverse of 
a matrix, the pseudoinverse rather than the inverse is required because the 
matrix is not squared, that is, an over-determined system is involved [12]. 

Finally, the parameters are the time delay A [as given in Eqs. ([3]) & dl])], 
the variable width k [as in Eq. ([6])], and the regularisation parameter 6 (see 
below) . 



3. 1 Regularisation 

—* — * 

In practice, the matrix K ([8]) may be singular because K is an over-determined 
system, and noisy time series ([2]) are involved. We therefore regularise the in- 
version in flTU]) through singular value decomposition (SVD) [I^. To avoid 
singularity, the most straightforward method is to find a threshold A for sin- 
gular values [12][2T] . This means that the singular values less than A are set to 
zero, following which f IlOp is obtained through SVD (12] . 

In other words, A tells us how many singular values to set to zero. Hence, for 
a given A, the number of singular values to keep may vary. We illustrate this 
through Fig. O representing artificial and optical data, where 9 is the number 
of singular values to set to zero. One can see a well defined pattern in the 
range 6 = [15, 27] (A = 5) in Fig. [ii, and 6 = [49, 72] (A = 419) in Fig. [5)d. 
Thus, if one can find a proper A that falls in this range, then one can claim 
that the estimation of A is "robust" . But the range of this pattern may change 
for other M and k parameters, in which case there is no guarantee that the 
estimated A falls in this range. Furthermore, no matter which method is used 
for assessing the goodness of fit, if we test A in a specific range with a fixed 
A, then we may come up with different values of ^ - some inside the pattern, 
some outside, none inside, etc. 

Instead of A, we use 6 clS db TQ gularisation parameter in §11 In fact, we aim to 
create an automatic algorithm that performs a global search for all parameters, 
and then finds the proper 6 that falls in the pattern; with our EA, we have 
done this (see A review of other general regularisation techniques for 
inverse problems can be found in Conan |13j . 
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Fig. 5. Patterns. In each relation (A, 6), the best time delay has been plotted, 
i.e., the best A (y-axis) versus the number of singular values 6 set to zero (x-axis). 
The best time delay is found through log-likelihood [T3| by evaluating time delay 
trials in a given range, (a) DS-5-1-G-0-N-0. This data set has no noise and no 
gaps; A = [0, 10] with increments of 0.1; M is set to its true value M = 0.1 (for 
more details concerning these data, see §2.2.ip . The pattern is at = [5,27], where 
A = 5 (true value), (b) Actual g-hand optical observations of quasar Q0957+561, 
A = [400,450] with unitary increments. The data set is as in §2.11 M was set to 
0.117 [30] and A; = 3 ([6]). The pattern is at 6* = [49, 72], where A = 419 (see Tables E] 
anddl and gS]). 



3. 2 Optimisation 



The above formulation can be seen as an optimisation problem where the 
variables are A, k and 6. Conventional gradient-based optimisation techniques 
cannot be used since the above kernel-based approach is not differentiable with 
respect to the discrete variables k and 9, regardless of the loss function. 

Of course, since both k and 6 are finite range discrete variables, one can employ 
a brute force search driven by cross-validation. Apart form having to deal in a 
systematic and time-consuming manner with a huge search space, it is also not 
clear what the appropriate ranges and resolutions should be for parameters 
such as A or k. In any case, we compare our EA approach (see section Q with 
a non-evolutionary kernel-based approach (K-V) for estimating time delay A 
by cross- validating k and the regular isat ion parameter A p^^p5] . 

An example of the search landscape is in Fig [HI We use A^m = 400 and 
^max = 450 with unitary increments, and 6* = 1,2, ...,n, where n = 97. The 
parameter k is fixed to 3, and the offset M to 0.117. The real optical data 
((yf-band) in ^is used. Then, Algorithm [1] (explained later in Q is applied to 
obtain the MSE(7y. Note that this landscape may change for other k values 
and for other data sets. We can see that from 6' = 80 to n the error surface 
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k = 3; A = 420; 9 = 58; log(MSE=0.001 9291 )+1 0=3.7493 
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Fig. 6. Example of Search Landscape. The data is from the doubly-imaged quasar 
Q0957-I-561, in the g-h&nd. The parameters k was fixed to 3, and the offset M to 
0.117. Algorithm [1] is used to obtain the log(MSE); see ^ The surface has been 
shifted upwards by 10 units for visualisation. 



is quite complicated for simple search algorithms, e.g., gradient descent (if 
differentiable), hill climbing or simulated annealing search. There are also 
more local minima when 6 < 45; see also Fig. [51 In the 6- A plane, the mark 
(x) shows the best parameter combination; i.e., minimum MSE(7y. To smooth 
the surface and help the visualisation, we use a logarithmic scale. 



4 Evolutionary Algorithm (EA) 



Following the kernel-based approach in ^ we have three parameters: (i) the 
time delay A; (ii) the variable width k; and (iii) the number of singular values 
to retain 6. Besides, we have a measurement of fitness (objective function), 
e.g. log-likelihood or a loss function, which, along with the others, gives us a 
third- dimensional search space $. Therefore, we follow an EA to avoid local 
minima [MfiS] . 
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Algorithm 1. Fitness function (A^, k^, Ox) 
Blocks ^ 5 

PointsPer Block <— n/ Blocks 
for i 1 to P ointsP er Block 

{ 

Remove the i*^ observation of each block and include it 

in the validation set V 

Compute fiA and hs for the training set T 

Get MSEcy on the validation set V 

R{i) ^ MSEcF 

} 

fx ^ mean{R) 
return f^ 



Let us define as our population 



Ai 01 ki 


fl 


As 62 k2 


/2 


A3; Ox kx 


fx 


A f) h 
i—Xp Up rvp 


fv. 



where each row in Pi is a hypothesis commonly referred to as individual or 
chromosome, which is a set of parameters {A^, Ox-, kx}-, randomly initialised. 

Then we have p hypotheses [31]. Each hypothesis x is evaluated by fx-, which 
is a measure of fitness. For this, we use the mean squared error (MSEf;^) given 
by Cross Validation (CV), in Algorithm [T], where T = O — V is the training 
set, O is the set of all observations such as O = {{xA{ti),XB{ti))\i}, and V is 
the validation set (the log-likelihood or simple mean squared error can also be 
used, but this might lead to overfitting). Thereafter, we apply artificial genetic 
operators such as selection, crossover, mutation and reinsertion to generate 
P2,---,Pg populations. At the g generation, we choose from Pg the best set 
of parameters (or individual) according to its fitness; i.e., with minimum fx- 
This procedure is summarised in Algorithm [21 and the details are in §4.1[ 

This process leads to artificial evolution, which is a stochastic global search and 
optimisation method based on the principles of biological evolution [20l|^ . 
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Algorithm 2. Evolutionary Algorithm 
(See \4-l\ for details) 
Initialise population Pi = [P/ Pf ] 
Evaluate population Pi with Algorithm [1] 
for ^ ^ 2 to ^ 

{ 

Select P/' and P/ from Pi 

Recombine P/', Recombine P/ 

Mutate P|' Mutate P/ 

Evaluate i^' = [p|' Pf ] with Algorithm [D 

Reinsert P^' into P^.i to obtain P^ 

} 

^.i Representation and Evolution Operators 



We represent every population Pi in every generation i = l,2,...,g, as two 
linked populations of the same size p, P^ = [P/ P/]. The x-th individual 
in population P^ corresponds to the x-th individual in populations P/ and 
P/. The population P/ uses reals to represent A^^, while P/ employs inte- 
gers to represent 6x and kx. First we initialise randomly Pi and evaluate the 
population with the above fitness function. Second, we select half of the pop- 
ulation Pi for reproduction. This selection of individuals is then applied to 
each sub-population P/, P^, i.e. the indexes of selected individuals in both 
sub-populations are the same. We use roulette wheel selection to obtain P/' 
and P/. Third, we apply individually recombination and mutation on P/' and 
P/. Finally, we evaluate the new linked population P^' = [P/' P/] to obtain its 
fitness and perform reinsertion of offsprings between Pi and P^' (elitist strat- 
egy). We repeat the above procedure until Pg is obtained. Note that Pi to Pg 
have always the same size. 

We use linear recombination and mutbga as mutatiorf^ (as in Breeder Genetic 

Algorithm [10]) for Pi', and double-point crossover and discrete mutation for 

Pi^'. In both cases, we use 0.5 as mutation rate. We employ a population size 
of p = 300 individuals and g = 50 generations, unless other values are given. 
The above evolutionary algorithmic is what we refer hereafter as EA (with 
mixed representation unless otherwise stated). 

Moreover, we evolved the M parameter, and, rather than mixed types, we also 

^ We also tested Gaussian mutation, which leads to a similar performance. 

^ We use the Genetic Algorithm Toolbox for MATLAB [10, 9j . which is available 

online with good documentation. 
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tested only real representation with a single population performing two kinds 
of flooring for integers, in the population and in the fitness function. 



5 Results 

Here we present the results of the application of our evolved kernel approach on 
real and artificial data. We compare the performance of our EA, on the same 
data sets, against two of the most popular methods from the astrophysics 
literature: (a) Dispersion spectra method [5^37f35j . and (b) the structure- 
function-based method (PRH, [51]). In addition, we compare with the perfor- 
mance of our previous approach, based on kernels with variable width (K-V) 
[H], which was applied to a different (radio) observational data set and one 
of the synthetic data sets (large scale data only). 

Two versions of Dispersion spectra are used; D\ is free of parameters [36)137] 
and D\ ^ has a decorrelation length parameter 5 involving only nearby points 
in the weighted correlation [371135] . For the case of the PRH method, we use 
the image A from the data to estimate the structure function ^T] . In the last 
subsection, we compare EA against a Bayesian method on data sets generated 
by this Bayesian approach [23] . 

In the first section below, we present the results of our analysis of the real 
observational data, followed by the results from the various synthetic data: 
large scale, PRH and Wiener data (see §2]) 

5.1 Astronomical observations 

Here we use the observational optical data outlined in §2.1[ We begin by 
showing the results of our EA evolving M with real representation only. The 
integer parameters are floored at fitness function. For M , and each parameter 
in Eq. (ITTi) . we define the following general bounds: A = [400, 450], k = [1, 15], 
9 = [l,n], and M = [0.10,0.20]. The results of ten runs (realisations) are 
given in Table [31 The set {A, M, 6, k} is the best solution (individual) at 
g = 50 according to fx (i.e., MSEcy). The column Convergence shows at 
what generation a stability has been reached, i.e., from what generation the 
MSEpy is constant. 

Of the continuous optimisation approaches, we tested one, (1+1)ES [15|, which 
is based on the Gray-code neighbourhood distribution, and uses real represen- 
tation. (1+1) means that one parent is selected and one child is produced in 
each single step of the Evolutionary Strategy (ES). We chose the (1+1)ES 
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approach because our fitness function is costly, so one expects to require fewer 
fitness evaluations than our EA. Rowe et al. [15] have shown superior per- 
formance of their (1+1)ES over Improved Fast Evolutionary Programming 
(IFEP), on some benchmark problems, and on a real- world problem (medi- 
cal tissue optics). IFEP is also a continuous optimisation approach [52]. For 
(H-l)ES, the precision is set to 200, and variable bounds set as above, allow- 
ing until 15,000 iterations. The convergence is reached after 14,410 iterations 
by using the same fitness function (Algorithm [T] in §1]), so we also floor at 
fitness function for integer variables. This ES yields A = 419.6, M = 0.1732, 
e = 58, k = 3, and MSEcy = 1.9249617 x 10"^ 

In Table HI we present ten runs, resulting from our EA with mixed types as 
discussed in §4.1[ Here, M is not evolved, being fixed to 0.117. The variable 
bounds are also set as above. Table [3] shows that regardless of the value of 
M, the time delay A is consistent, this justifies that M does not need to be 
evolved. Rather, we use the reported value M = 0.117 [3(J|. 

We point out that in Tables [3] and H] the EA suggests 6* = 58, which falls within 
the pattern in Fig. [5]d. 

In Table |3l we can see that the parameter M is not crucial in the time delay 
estimation. Therefore, in Table HJ we omit it. The results of these tables yield 
similar A estimates regardless of the representation (reals or mixed types). 

The results of the (1+1)-ES are also consistent, even though (1+1)-ES requires 
a larger number of iterations. On the one hand, for our EA, when g = 50 
(maximum number of generations), we perform 7,800 evaluations of the fit- 
ness function, because of our elitist strategy. On the other hand, (1+1)-ES 
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Evolutionary algorithm with mixed types: results on DSl 
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tends to converge in around 14,000 iterations across different initialisations. 
Every iteration corresponds to a fitness evaluation. (H-l)-ES demands more 
computational time (about twice as much) and therefore we do not use this 
algorithm to analyse artificial data. Since we use the same fitness function, 
one would expect to get similar performance to the EA. Moreover, a theo- 
retical analysis in multi-objective optimisation suggests a better performance 
of population-based algorithms (such as the EA used here), compared with 

(i+i)Es [ig. 

In the astrophysics literature, the best (smallest quoted error) previous mea- 
sures for this time delay can be found to be 417±3 days [SU] and 419.5±0.8 
days [15]. Therefore, the results in Tables [3] and H] are consistent. However, 
we think that the estimate of 417±3 days, from this data set, is underesti- 
mated because, for the quasar Q0957-I-561, the latest reports also give esti- 
mates around 420 days by using other data sets [33] • One is reminded that 
the gravitational lensing theory predicts that the time delay must be the same 
regardless of the wavelength of observation jl3|l29f46] . 



5.2 Artificial Data 

For the analysis of real data presented above, we do not know the actual value 
of the time delay. In order to evaluate the relative performance of various 
methods, we therefore present the analysis of synthetic data sets, produced 
from a set of known parameters. 
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5.2.1 Large Scale Data 

In all cases, the time delay under analysis is given by trials of values between 
^min = and Amax = 10 (also bounds in our EA), with increments of 0.1, 
where the ratio M is set to its true value 0.1. The parameter 6 is set to 5, 
for -Dig. When using the PRH method, we use bins in the range of [0, 10] 
for estimating the structure function from the light curve of Image A. In our 
EA, besides the above A bounds, we use the following bounds: 9 = [l,n], and 
k = [1, 15]. For K-V, we cross- validate k and A; the ranges are k = [1, 15] and 
A = [10-\ 10-2, lO-'^] (see M). 

Table presents the results for all time delay estimates; i.e., rj = 38, 505. The 
best results are highlighted in bold. Regarding the statistics in Table O let 
Aj, j = 1,2, ...,1], be estimated time delays, where r] is the quantity of time 
delay estimates. The empirical mean is 

A = ^E4' (12) 
'/ j=i 



and the empirical standard deviation is 



a 



(13) 



The estimators /t and a are used to estimate the bias and variance of time 
delay estimates, respectively. The mean squared error is given by 



MSE = -X:(A,-^o)', 

V j=i 



(14) 



where /iq is the true time delay. The average of absolute error is 

AE= -X:|A, -/io|. (15) 



V 



The 95% confidence intervals (CI) for fi are given by /t ± 1.96 x a/y^, where 
the constant depends on the desired confidence level and the sample size; e.g., 
see Table Ilia in [3] . 

We also performed a t-test on these results, where the hypothesis to test is 
-^0^ f^o = 5. The results are shown in Fig. [TJ where the estimates are grouped 
by the underlying function, the level of noise and gap size. Since T follows a 
Student's t-distribution, which is centred at zero, those values close to zero are 
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Table 5 

Large scale data results: statistical analysis 
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Table 6 

Large scale data results: t-test 
Method 0% 0.036% 0.106% 0.466% 
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see ^5.2.11 for details 



statistically significant p^fTGfS] . The horizontal dotted line shows the threshold 
for a significance level of 95%, a = 0.05; i.e., when V < a, where V is known 
as the p-value. Thus, the threshold values for |T| in Fig. [7] are 2.2, 2 and 1.9 
for z/ = {9,49,499}, degrees of freedom, respectively (see Table [2]). 

In Table [6], we show the number of cases that satisfy the above threshold values. 
In other words, see Fig. [7] and count the number of points below the horizontal 
dotted line per level of noise- the significant results. In Table El the results 
are grouped by noise level only, and the best ones are highlighted in bold. 
We also tested the significance of time delay estimates with nonparametric 
hypothesis testing, such as sign test and Wilcoxon's signed-rank test [3J, with 
similar results. 

Like Table El Table [7| shows the quantity of cases where the true delay, A = 5, 
falls within the 95% CI. The results are also grouped by noise level. In Fig. [HI 
we illustrate the 95% CI for DS-5-1, i.e., one underlying function with 0% of 
noise only. 

In Fig. [9] are shown the results of MSE (fT^ . where the estimates are grouped 
by level of noise as in the previous figure - Fig. [71 Here, for low levels of 
noise: 0% and 0.036%, the asterisk (the proposed EA) has and outstanding 
performance because the MSE is close to zero. The AE statistic ([T^ gives 
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Fig. 7. The t-test results applied to artificial Large Scale data. Each row corresponds 
to a different underlying function (DS-5-l,DS-5-2,...,DS-5-5), and each column cor- 
responds to a different level of noise (0%, 0.036%, 0.106% and 0.466%). Every plot 
shows the results of |T| from five methods; i.e., Df, -D| 2' PRH) K-V and EA; shaded 
point, circle, diamond, triangle and asterisk respectively. Note that all the plots have 
the same scale on the y-axis. See §5.2.11 for details. 

Table 7 

Large scale data results: 95% CI 



Method 0% 0.036% 0.106% 0.466% 
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similar results with this grouping (not shown). 

From Table [5l the best results are for D^, K-V and EA. Since the noise is about 
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Fig. 8. 95% Confidence Intervals on Large Scale Data, DS-5-1 with 0% of noise only. 
This plot shows the 95% CI from five methods: Dj, Dl^, PRH, K-V and EA. The 
PRH intervals are not visible here because are below the bound of 4.5 days (see 
Table [5]). We use this bound for visualisation purposes. The horizontal and dotted 
line is located at the true delay, A = 5. 



0.01 mag (< 0.106%) (standard deviation) in the observational optical data, 
we are interested in exploring the effects of various levels of noise. Therefore, in 
Table [HI we show the results of the estimates grouped by noise level, regardless 
of the gap size. The best results are highlighted in bold. As in Tables [6] and 
[71 and Figure [HI the results from EA are promising. 

Finally, we compare the performances of methods K-V and EA through 
paired tests on time delay estimates and MSE. As an example, in Fig. [10], we 
show K-V against EA with paired t-test. The bars represent the mean estima- 
tor fi—fio for each method, where fi is the mean of time delay estimates, and /xq 
is the true time delay. At the top of each plot appears either a circle or a plus 
symbol representing K-V and EA, respectively. If MSE^-y < MSE^;^ then 
a circle appears, and we display the plus symbol when MSE^-y > MSE^;^!. 
The asterisk at the top means that the difference is significant at the level 
V < 0.05, i.e., 95% confidence level. In simple words, if the bar is large then 
the results are bad, because one is far from the true value = 5. Note that 
when the noise is low, 0%, 0.036% and 0.106%, the empty bars (EA) are small, 
that is, when the plus symbol (-I-) appears at the top. Therefore the results 
from EA are interesting. Moreover, note that the asterisk (*) at the top ap- 
pears in some cases. The asterisk means that the comparison is significant 
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Fig. 9. The use of MSE on artificial Large Scale Data. Each row corresponds to a dif- 
ferent underlying function (DS-5-l,DS-5-2,...,DS-5-5), and each column corresponds 
to a different level of noise (0%, 0.036%, 0.106% and 0.466%). Every plot shows the 
results of MSE statistic from three methods; i.e., Di, K-V and EA, indicated by 
shaded point, triangle and asterisk, respectively. 

when performing the paired t-test with a 95% confidence level. 

In Table M, we summarise the number of cases where V < 0.05, including 
paired sign test and paired-sample Wilcoxon signed-rank test. We also compare 
the results from with K-V and EA. For each comparison, the pairs are 
represented by Mi (o) and M2 (+). The columns corresponding to t-test show 
two quantities. The first one corresponds to the quantity of symbols "o" (Mi) 
with an asterisk also. The second one is the quantity of symbols "+" (M2) 
with asterisks. To obtain these numbers one should imagine a figure similar to 
Fig. [in] for the methods involved. The columns for sign test and signed-rank 
test are obtained in a similar manner; rather than t-test in Fig. [TOl were used 
other tests. Note again that the best results are from EA. 

Now lets summarise the results found so far. Table [5] contains the results of the 
analysis of these data sets over all time delay estimates regardless of noise and 
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Table 8 

Large scale data results grouped by noise level 
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Table 9 

Large scale data results: paired tests 
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gap size. Each row corresponds to a different statistic: 95% CI, MSE, AE, fi 
and a. The accuracy is measured by MSE and AE, where the best results are 
for K-V, which is followed by EA. Since fi and a are the mean and standard 
deviation over all estimates, these statistics can be seen as a measurement of 
bias and variance of time delay estimates over all data sets. Since the true 
delay is /xq = 5, the minimum bias is for Dl (|5.013 — /xo| = 0.013); in second 
place is EA (|5.015 — = 0.015). The minimum variance is for K-V (0.68); 
in second position is EA (0.79). 

However, in practice, the noise is about 0.01 mag (< 0.106%) for the optical 
data. In Tables E] to [HI and Figs. [3, M and [TUl the results are grouped by noise 
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Fig. 10. Paired t-test on Large Scale Data. The paired f-test is performed on time 
delay estimates from K-V and EA. Bars represent the estimator fi — ^q. At the top 
of each plot, a circle appears when MSEk-v < ^SEea, and the plus symbol when 
MSEi^_y > MSEea- The asterisk at top means that the p-value resulting from 
paired t-test is less than 0.05, i.e., 95% confidence level. 



level. Tables [6] and [7| suggest that the results from EA are more statistically 
significant than others (see §5.2. ip . Table [8] shows that the results are also 
better from EA, particularly when the noise is less than 0.106%. If the noise 
level is equal to 0.106%, depending on the statistic, the best performance is 
by either K-V or EA. When the noise is high (0.466%), the best results are 
from K-V. This can be also seen in Figs. [9] and [TOl 

The results from paired tests in Table [9] also suggest that the EA is capable of 
producing significantly superior time delay estimates, when compared to 
and K-V. 
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Table 10 

PRH data results: statistical analysis of the idealised method, PRH method with 
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5.2.2 PRH Data 



Now lets compare the performance of the proposed method EA with another 
kind of data. Here we use the artificial data generated by the PRH method- 
ology (see §5.2 in [H] and §6 in [14j); so we compare only the PRH method 
against EA. In fact, we compare the performance of EA with the PRH method 
by fixing the PRH parameters to those values used to generate the data (the 
ideal scenario). The structure function (SF) to define the covariance matrix in 
the PRH method is used in two ways: SF is fixed to its true value (SF*) and 
then estimated following the PRH method (SF-I-). 

Note that for these data there are several true time delays, ^q. In all cases, we 
use bound^ of /iq ± 30 days with unitary increments during the time delay 
analysis. The measurement error is also fixed at its true value (variance of 
1 X 10~^) for all methods. 

The results for the PRH method, SF* case, are in Table [TOl The column 
denotes the true time delay, which is also our hypothesis in the t-test, whereas 
rj = 100. The following columns are the statistics used in this analysis (see 
section §5.2. ip . The last row is the average (Avg). The results for the SF+ 
case and for EA are in Table [TT] and Table fT2] respectively. An analysis of bias 
(l/io — ft\) and variance (a) is in Table fT3l 

Results from paired tests are in Table [HI where the p- values are shown. The 
best values are in bold, i.e., V < 0.05. 

Tables [TOlfTSl show the results of the analysis applied to PRH data. Two ver- 
sions of the PRH method were used: SF* and SF-|- (see 15.2.21) . Here, contrary 
to the analysis of the large scale data, the noise level and gap size are fixed 

^ These bounds are also used to estimate the structure function SF+ by the PRH 
method. 
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Table 11 



PRH data results: statistical analysis of PRH method with SF+. 





r 


r 


95% CI 


CI range 


AE 


MSE 


34 


0.000 


-4.881 


18.9 -27.6 




8.72 


22.79 


593.45 


43 


0.210 


1.260 


81.8 - 48.2 




6.46 


13.51 


266.19 


49 


0.315 


-1.008 


43.7 - 50.7 




6.96 


15.15 


307.95 


59 


0.977 


-0.028 


54.6 - 63.1 




8.51 


19.66 


455.20 


66 


0.257 


-1.138 


58.7 - 67.9 




9.23 


22.21 


542.99 


76 


0.031 


-2.188 


66.7 - 75.5 




8.83 


20.95 


513.97 


99 


0.407 


-0.832 


93.0 - 101.4 




8.39 


18.84 


446.00 


Avg 


0.314 








8.16 


19.02 


446.54 


Table 12 

PRH data results: statistical analysis of EA. 




V 


r 


95% CI 


CI range 


AE 


MSE 


34 


0.600 


0.525 


32.3 - 36.8 




4.58 


7.68 


132.27 


43 


0.330 


0.977 


42.5 - 44.4 




1.96 


2.28 


24.34 


49 


0.389 


-0.864 


46.8 - 49.8 




2.94 


3.99 


54.69 


59 


0.684 


0.407 


57.2 - 61.9 




4.35 


7.28 


119.00 


66 


0.957 


-0.052 


63.9 - 67.9 




3.98 


7.16 


99.53 


76 


0.301 


-1.039 


73.0 - 76.9 




3.83 


6.89 


93.42 


99 


0.830 


0.214 


96.7 - 101.7 




4.94 


8.61 


153.43 


Avg 


0.585 








3.80 


6.27 


96.67 


Table 13 

PRH data results: Bias (^o ~ ) versus 


1 Variance (a). 






PRH with SF* PRH with SF+ 




EA 


IJ-o 


A 


a 


Bias fi 


a 


Bias 




a Bias 


34 


34.0 


2.3 


0.00 23.2 


21.9 


10.73 


34.4 


11.8 0.46 


43 


43.1 


4.7 


0.18 45.0 


16.2 


2.05 


43.7 


4.8 0.79 


49 


50.6 


7.0 


1.68 47.2 


17.5 


1.77 


48.4 


7.7 0.57 


59 


60.0 


5.5 


0.07 58.9 


21.4 


0.06 


59.9 


9.7 0.96 


66 


66.0 


2.5 


0.07 63.3 


23.2 


2.65 


65.7 


9.2 0.21 


76 


76.7 


3.8 


0.79 71.1 


22.2 


4.87 


75.1 


10.2 0.86 


99 


100.9 


7.2 


1.95 97.2 


21.1 


1.76 


99.8 


12.2 0.89 


Avg 




4.7 


0.81 


20.5 


3.41 




9.4 0.68 
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Table 14 



PRH Data results: 


paired tests 












paired t-test 


paired 


sign test 


paired signed-rank test 




or &l rji\ 


or -|- cZ rji\ 


SF* & EA 


SF+ & EA 


or &6 


or -|- 06 iL/A 




U.UOO 




0.483 


0.001'' 








0.364 


0.476 


0.057 


0.089 


0.004^ 


0.249 


49 


0.021'' 


0.567 


0.012" 


0.193 


0.002°^ 


0.425 


59 


0.919 


0.673 


0.012^* 


0.069 


0.284 


0.847 


66 


0.759 


0.358 


0.617 


0.483 


0.706 


0.358 


76 


0.112 


0.113 


5e-4^ 


0.368 


0.001^ 


0.145 


99 


0.457 


0.288 


0.920 


0.012'^ 


0.920 


0.238 



see ^5.2.21 for details. 

EA with minimum MSE. 
^ PRH method with SF* has the minimum MSE. 



(see Rather, we use different short delays in the range of 30 - 100 days. 

Comparing Tables fT0lfT2l the highest significance level (V) on /iq = {34, 43} 
is for the PRH method (SF*). EA has better significance levels than the PRH 
method for /io = {66,99}, even considering the idealised case. For CI range, 
MSE and AE statistics, SF* has the best performance on any fiQ. But, on all 
statistics EA performs better than SF+. Therefore, EA is more accurate than 
PRH method (SF+). 

In Table [121 we have fi and a as a time delay estimator and a measurement 
of uncertainty, respectively. Hence, the column Bias is given by |/io — Al- The 
last row has the average (Avg). Therefore, the minimum bias is for EA (on 
average). If we use a as variance measurement, the minimum variance is for 
PRH method (SF*). But, EA has lower variance than SF+. 

Table [14] shows that in few cases the paired difference between estimates is 
statistically significant (in bold). In those cases, EA is more accurate than 
PRH method (SF+), and in some cases EA is also more accurate than PRH 
method (SF*). 



5.2.3 Wiener Data 

Similarly, we compare the performance of EA with another kind of data (see 
These data come from a Bayesian approach [23j. We performed an analysis 
as above with r] = 225, so the hypothesis to test is Hq : fiQ = 35. 
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Table 15 

Wiener Data results: Statistical Analysis 

Data Set 



Method 


Statistic 


0.1 


0.2 


0.4 




V 


0.59 


0.63 


0.06 


Bayesian method 


T 


0.52 


0.48 


1.87 




MSE 


32.18 


9.43 


41.89 




AE 


1.84 


1.94 


3.7 






35.2 


35.1 


35.8 




a 


5.7 


3.1 


6.4 




V 


0.92 


0.76 


0.007 


EA 


T 


0.09 


0.30 


2.70 




MSE 


10.06 


zo.zo 


no 




AE 


1.76 


3.28 


5.72 






35.0 


35.1 


36.4 




a 


3.1 


4.8 


8.0 


Table 16 










Wiener Data results: Paired Tests 








Data Set 








Paired test 0.1 


0.2 


0.4 






t-test 0.633 


0.982 


0.264 






sign 0.505 


0.893 


0.007 






signed-rank 0.827 


0.766 


0.005 







The result^ are shown in Table [151 Since the data were generated by the 
Bayesian method, we aim to compare such a method with EA only. 

In Table [16] are the results from paired tests between Bayesian estimation 
method and our EA. We highlight in bold the p- values that are less than 0.05. 

In Table [T^l the EA results are more significant for data with noise levels of 
0.1 and 0.2, where V is 0.92 and 0.76 respectively. But for a noise level of 
0.4, it does not perform as well. In terms of bias, EA performs better for the 
data with a noise level of 0.1, and ties in the case of that with 0.2 data and 
on the noise of 0.4 data, the performance is not good enough. Talking about 
variance o", EA is better on data set of noise 0.1 only. It needs to be emphasised 
though, that these data was generated by the Bayesian estimation method, so 
the comparison is positively biased towards the Bayesian method. However, 



ji and a for the Bayesian method are not reported in detail in [23|24j 
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Table [16] suggests that most of the paired differences between estimates by 
both methods are not statistically significant. 

The poorer performance of the Bayesian method for low noise data, com- 
pared to medium noise is explained by the posterior sampler, which does not 
converge properly in some of the cases, giving biased estimates. This can be 
easily avoided when analysing any given data set, since the convergence can 
be assessed and the sampler re-run with different parameters. With the re- 
peated runs performed here, the same parameters for the sampler with all of 
the datasets were used. In the low noise case, the convergence measurement 
indicates that, for a number of cases, these have not been optimal. 



5.2.4 Loss Functions 

Talking about loss functions, we compared the K-V and EA methods on 
Large Scale data (see §5.2.ip . On the one hand, K-V employs the negative 
log-likelihood as loss function, where the parameters k and A are es- 

timated via cross-validation for trial time delays in the range — 10. On the 
other hand, the fitness function of EA is the MSE(7y given by cross-validation. 
We also compared the K-V method with EA by using the mean squared er- 



roi 



10 



as the measurement of goodness of fit for K-V, rather than the negative 
log-likelihood. Thus, the observational error is considered constant, because 
the negative log likelihood fitting criterion in K-V has the form 



'^^Si — ^) — ^ — ^) — ]■ 

For the K-V method, the negative log-likelihood cost function (fT6l) lead to 
more accurate estimates. 

Furthermore, we tested another fitness function on these data. Instead of 
MSEf7y, the fitness is given by the log-likelihood Q, where no cross-validation 
is performed, since k and 6 are also evolved. The results are that MSEcy 
performs better than Q, where Q is less time-consuming. 



5.2.5 Evolving Weights 

We also tried to evolve all the free parameters; that is, the weights a in (j3])- 
(jl]). This allowed us to avoid SVD in (fTOj) . which is 0{n^) (without cross- 
validation). However, the performance is poor because the number of param- 

^'^We point out that this mean squared error is different to MSEc^; i.e., Eq. p6|) 
without o"^(ti) and cr'^{ti) 
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eters to evolve increases with the number of samples n. In fact, we tested this 
approach on artificial data and the performance was inferior to that of our 
EA. This is due to an overwhelming number of variables. Typically, using evo- 
lutionary approaches, one can well optimise in about 30-dimensional search 
space |l9]. Perhaps a new framework can overcome this problem. 



6 Conclusions 

From observed optical monitoring data, we suggest 419.6 days as the best 
plausible value of the time delay between the two main images (A and B) of 
the distant lensed quasar Q0957+561. 

Regarding the artificial large scale data, the results from EA are important, 
because its accuracy in this application is matched to the precision and low 
levels of noise with which the current state-of-the art optical monitoring data 
are being acquired for multiple- image time delay measurement [TT]l33ll40pT] . 




From PRH data, we conclude that EA is more accurate than PRH method 
(SF+), and competitive with the idealised case (SF*). Unfortunately for the 
PRH method, the idealised case does not exist in practice. 

For the Wiener data, the conclusion is that for these data, EA outperforms 
Bayesian estimation method for low levels of noise (0.1 data). In other cases, 
depending on the statistic, EA can show better, equal or worse results. We 
stress that the current real optical obsrevational data are indeed characterised 
by low levels of observational noise. It is exactly in this context where our EA 
method outperforms the Bayesian estimation method. 

An evolutionary algorithm has been introduced to form a novel hybridisation 
with our kernel method, which is an automatic method to estimate the time 
delay, kernels width, and SVD regularisation parameters. It is a successful ap- 
plication of EA driven by a model based formulation to a real-world problem. 
The study of the statistical significance of results on different data shows that 
EA is a promising approach on gravitational lensing, in astrophysics, but it 
can also be applied in other disciplines involving similar time series data. 



7 Future Work 

One of the main issues to deal with, in future extensions of this work, is speed- 
ing up our EA, which potentially can be done in several ways. First, we would 
seek the optimum procedure to invert (fTUj) and its regularisation to deal with 
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ill-conditioning (see [3]). Alternative ways for model selection, rather than using 
cross-validation, which is O(n^), are desirable (see in §1]). The natural paral- 
lelisation of EA is another research line to follow; e.g., see P]. To speed up the 
EA, fitness approximation techniques may also help with this problem [HIIS^ 
since our fitness function is costly. Moreover, other evolutionary approaches 
in presence of uncertainties have not been tested [28||2] . 

The time delay problem is still a huge issue in astrophysics. The next gen- 
eration of large-scale monitoring projects with dedicated telescopes, LSST 



(http: //www.lsst.org/ ) and PAN-STARRS ( ]http://pan-starrs.ifa.hawaii.edu/| ) 
will produce monitoring data for thousands of potential multiply- imaged sources 
(as opposed to the 10 sources for which data are now available). These datasets 
will be available in a few years time, and a large effort is underway to develop 
algorithms that are far more sophisticated than the currently available ones 
that would deal with this data in an automated way, and can cope with the 
inevitable noise and gap features of the data. We are attempting to develop 
robust methods that will do this. 
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